On Dataless Hierarchical Text Classification
نویسندگان
چکیده
In this paper, we systematically study the problem of dataless hierarchical text classification. Unlike standard text classification schemes that rely on supervised training, dataless classification depends on understanding the labels of the sought after categories and requires no labeled data. Given a collection of text documents and a set of labels, we show that understanding the labels can be used to accurately categorize the documents. This is done by embedding both labels and documents in a semantic space that allows one to compute meaningful semantic similarity between a document and a potential label. We show that this scheme can be used to support accurate multiclass classification without any supervision. We study several semantic representations and show how to improve the classification using bootstrapping. Our results show that bootstrapped dataless classification is competitive with supervised classification with thousands of labeled examples.
منابع مشابه
Importance of Semantic Representation: Dataless Classification
Traditionally, text categorization has been studied as the problem of training of a classifier using labeled data. However, people can categorize documents into named categories without any explicit training because we know the meaning of category names. In this paper, we introduce Dataless Classification, a learning protocol that uses world knowledge to induce classifiers without the need for ...
متن کاملJoint Embedding of Hierarchical Categories and Entities for Concept Categorization and Dataless Classification
Existing work learning distributed representations of knowledge base entities has largely failed to incorporate rich categorical structure, and is unable to induce category representations. We propose a new framework that embeds entities and categories jointly into a semantic space, by integrating structured knowledge and taxonomy hierarchy from large knowledge bases. Our framework enables to c...
متن کاملMulti-label Dataless Text Classification with Topic Modeling
Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing works mainly center on single-label classification problems, that is, each document is restricted to belonging to a single category. In this paper, we pro...
متن کاملCross-Lingual Dataless Classification for Many Languages
Dataless text classification [Chang et al., 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a crosslingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign lang...
متن کاملDataless Text Classification with Descriptive LDA
Manually labeling documents for training a text classifier is expensive and time-consuming. Moreover, a classifier trained on labeled documents may suffer from overfitting and adaptability problems. Dataless text classification (DLTC) has been proposed as a solution to these problems, since it does not require labeled documents. Previous research in DLTC has used explicit semantic analysis of W...
متن کامل